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Abstract 


The main contribution of this article is a new prior distribution over directed acyclic graphs, 
which gives larger weight to sparse graphs. This distribution is intended for structured Bayesian 
networks, where the structure is given by an ordered block model. That is, the nodes of the graph 
are objects which fall into categories (or blocks); the blocks have a natural ordering. The presence 
of a relationship between two objects is denoted by an arrow, from the object of lower category to 
the object of higher category. The models considered here were introduced in Kemp et al. for 
relational data and extended to multivariate data in Mansinghka et al. Q. The prior over graph 
structures presented here has an explicit formula. The number of nodes in each layer of the graph 
follow a Hoppe Ewens urn model. 

We consider the situation where the nodes of the graph represent random variables, whose 
joint probability distribution factorises along the DAG. We describe Monte Carlo schemes for 
finding the optima l aposteriori structure given a data matrix and compare the performance with 
Mansinghka et al.l and also with the uniform prior. 


1 Introduction 

1.1 The Block Model and Ordered Block Model 

The block model and ordered block model were introduced by Kemp et al. 8(2004) for analysing 
relational data. This is an interesting and versatile idea for the situation where classification of 
individuals is to be inferred from observing how the individuals relate to each other. Consider d 
individuals, represented as nodes on a graph. For example, on a humorous note, Kemp et. al. suggest 
an ecclesiastical gathering where the participants are dressed informally, so that the observer cannot 
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infer the class structure of the group simply from observing each individual separately. Each has an 
unknown status; vicar, bishop, archbishop, or any other status within the Anglican hierarchy. The 
observer is not an ecclesiastical expert and does not know much about the hierarchical structures 
within Anglicanism. In particular, he does not know a priori the number of classes. He notes how 
these individuals interact with each other. If one individual behaves deferentially towards another, it 
may be assumed that he is from a class of lower order; a directed arrow is inserted from the individual 
of lower order to the individual of higher order, thus producing a directed acyclic graph (DAG), where 
each individual is represented as a node. It may be that the individuals present at the event do not 
all know each other; a vicar who meets a bishop will show deference only if he knows the bishop; he 
therefore does this with probability p, where 0 < p < 1. 

To assess the class structure, one starts with a prior distribution over classifications and then 
updates to a posterior distribution, given the observed information of the DAG. Either there is a 
directed edg e, or there is no edge, or else no interaction was observed and the edge status is unknown. 


Kemp et al.l propose a Hoppe urn scheme model to construct a prior distribution over the class struc¬ 


ture. Gonditioned on the class structure, they propose a Beta distribution for the edge probability 
between nodes in different classes, which leads to a joint prior distribution over classifications and 
DAGs. From this, the posterior distribution over classification, given the DAG, may be computed. 

Two models are proposed; the ordered block model where there is a hierarchy and the classes have 
a distinct ordering and the block model, where there is not a hierarchical structure between classes. In 
this article, we only consider the first of these; the ordered block model. 

These mode ls are used for wide ranging problems of relational data. Another example developed 
by iKemp et al.l is to infer the societal structure of aboriginal tribes. They also use the model for 
experiments in cognition, simulating the process of human learning. Ghildren are given a collection 
of d objects, each identical in appearance. Some belong to category A, the others to category B, 
although the children are not told the number of different categories in advance. In one experiment, 
when an object from category A touches one from category B, the category B object may light up. 
There is a probability p that a given category A object activates a given category B object. Activation 
is represented by a DAG, from which the classification of the objects is to be inferred. The ordered 
block model is relevant when objects from category A do not light up; the block model is relevant when 
the action is reciprocal. 


1.2 Multivariate Data and Classification of Variables 

Multivariate data comes in the form of an nxd data matrix x, where each row represents an independent 
instantiation of a random d-vector with probability distribution Throughout, x will denote 

the matrix of instantiations and X to the underlying random matrix. Also, we use X = (Vi,... ,Xfi) 
to denote the random vector (taken as a row vector, in accordance with the way that data is presented 
in a data matrix). 

Much work has been carried out concerning classification of instantiations, where one (or several) 
of the variables is a class variable and, given values taken by the other variables, the classification 
of the instantiation should be inferred. There is less work concerning classification of variables. An 
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important contribution in this direction, where the number of classes is a priori unknown, is found in 
Mansinghka et al. [^, who extend the work of Kemp et al. to accommodate the situation where the 
individuals in question, which comprise the node set of the graph, are random variables. 

In many ‘real world’ problems, variables may be grouped into classes which indicate the nature 
of their probabilistic relations. For example, in the QMR-DT network, introduced by Shwe et al. 


m variables fall into two classe^ diseases and symptoms, where various diseases may cause various 


symptoms. In Mansinghka et al. ^ , the HEPAR II network, due to Onisko et al. [9[] is considered. The 



DAG is shown in Figured! It is a Bayesian Network for analysing liver diseases where the variables can 
be divided into three classes; risk factors, diseases, symptoms. The conditional probabilities for the 
HEP AR II network have been learned from a database of medical cases. This example is considered 
in by iMansinghka et al.l when evaluating their algorithm. 

Another example is the problem of finding genome pathways. Paced with n instantiations (where n 
is of the order of thousands) of the expression levels of d genes (where d is of order tens of thousands), 
the aim is to classify the genes and discover the class of regulators, which are responsible for controlling 
the activation of other genes, and which may influence each other. 

The aim of the data analysis is therefore two-fold: to find a suitable classification of the variables 
and also structure learning] namely, to find a graph which encodes the direct influences that the 
variables have on each other, where the layers of the graph correspond to the classes of an ordered 
block model. 


1.3 Bayesian Networks and Structure Learning 

A collection of random variables {Xi,... ,X(i} can be ordered in d ways and each gives rise to a 
factorisation: for a permutation cr of {1,... ,d}, 

j=i 

where, with the ordering a, for each j, Pa(Ao-(j)) c ..., =: PJ and VJ denotes the 

a-predecessors of For the ordering a, for each j, the set Fa(a{j)) is taken as the smallest 

subset of Vj such that Equation ([1]) is true. A factorisation may be expressed by a Directed Acyclic 
Graph (DAG, also known as acyclic digraph, ADG), a directed acyclic graph which has a directed 
arrow j ^ k and only if Xj e Pa(A:). Such a factorisation, together with the corresponding DAG, is 
known as a Bayesian Network (or BN). 
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For multivariate data with large numbers of variables, it is usually not feasible to learn the empirical 

probability distribution; if variable Xj has kj possible states for j = 1,... ,d and no assumptions may 

d 

be made about the independence structure, then a total of ^ values need to be stored. Even 

i=i 

if each variable only has two possible states, this value is 2^^ - 1, which is growing exponentially in 
the number of variables. If the distribution may be factorised according to a BN where the DAG is 
sparse in the sense that, for each j e {!,... ,(i} the set Pa(j) is reasonably small, the storage may 
be reduced considerably. Furthermore, if there is a substantial independence structure that can be 
exploited, there are fewer parameters to be estimated from the n data points and hence the estimation 
is more accurate. 

The key result for Bayesian Networks is that a Directed separation (D-separation) statement 
for the DAG implies that the corresponding conditional independence statement for the probability 
distribution is true. It is computationally straightforward to verify graphical separation; the whole 
point of expressing a probability distribution as a Bayesian Network is to use graphical separation 
algorithms to infer those conditional independence statements for the random variables that correspond 
to graphical separation statements. 

Unfortunately, in almost all practical situations, there does not exist a DAG such that the converse 
is true; it is not possible, in general, to find a DAG such that D-connection in the DAG implies 
the corresponding conditional independence statement. Therefore, the main thrust of the Bayesian 
Network approach is (a) to find the BN with the sparsest graph along which the probability distribution 
factorises and (b) conclude conditional independence when there is D-separation. Those conditional 
independence statements which cannot be expressed in terms of graphical separation are beyond the 
scope of the Bayesian Network approach. 

Representing a probability distribution as a BN is often particularly efficient if the DAG represents 
a causal structure that is known in advance. If there are known cause-to-effect relationships between 
the variables, then the variables are ordered in such a way that a cause has a lower order than an 
effect. It is well established, though, that it is, in general, not possible to infer causality simply from 
data; if a DAG is learned from data, directed arrows do not in and of themselves suggest that a ‘cause’ 
to ‘effect’ relationship may be present. 

In some situations, it may happen that there exists a BN such that the set of graphical separation 
statements for the DAG is equivalent to the set of conditional independence statements for the prob¬ 
ability distribution. In such a situation, the DAG is said to be faithful to the probability distribution. 
Faithfulness is extremely rare, but it can occur in cases where all the variables influencing the system 
are observable and where there is a natural causal ordering between the variables. 

Structure learning algorithms fall broadly into two categories; constraint based and search and 
score. There are also hybrid algorithms which are a mixture of the two, which try to take the best 
features from both. While constraint based algorithms are, in general, more economical, they have well 
established problems. These are mostly connected with lack of faithful graph. A DAG contains an edge 
between X and Y if and only if there is no subset S which D-separates X and Y, but those algorithms 
that work on the principle of removing an edge X ~ Y whenever a conditioning set S is found such 
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that X ± ylS" (practically all constraint based algorithms) are guaranteed to perform badly in ‘real 
world’ situations, since only synthetic data simulated from a distribution with a faithful graph has 
any serious chance of coming from a distribution which has a faithful graph (they perform well in such 
situations, which usually form the basis of the criteria used for evaluation). Other minor difficulties 
(minor because these algorithms perform badly even if there is a perfect oracle) are connected with 
the power of the conditional independence tests when there is a large set of conditioning variables. 
Edges are wrongly removed because ‘do not reject’ independence is wrongly taken to mean ‘accept’ 
independence, hence deletion of an edge; weak tests lead to false negatives, which contradict other 
‘reject conditional independence’ statements derived from tests which are more reliable, because the 
statement ‘reject the null hypothesis’ conforms to the principles of statistical theory, while taking ‘do 
not reject the null hypothesis’ to mean ‘accept the null hypothesis’ violates these principles. 

Search and score methods do not have these difficulties. A score function is defined on the space 
of DAGs and the aim is to find the DAG which gives the highest score. They are more rubust, but 
computationally substantially more expensive. 

Exhaustive search by scoring every structure is usually not computationally feasible; a good search 
and score algorithm will try to find favourable regions in the search space. Since not all structures are 
visited, there are no guarantees that the optimal graph will be chosen; the aim is to locate a reasonable 
structure that encapsulates the main features of the independence relations between the variables. 

One way to score the graphs is to consider a prior distribution over graph structures Pg, the 
Gooper-Herskovits graph likelihood function Px|g) which is the probability distribution over X given 
the graph structure. The posterior is proportional to: PgPx|g a-iid this may be used as a score function. 
The structure which maximises this score maximises the posterior distribution and hence we call it 
the maximum aposteriori structure (MAPS). The Gooper-Herskovits likelihood is well known, with a 
convenient closed form. To complete the score function, a prior distribution Pg over structures has to 
be established. 

1.4 Distributions over Graph Structures 

The choice of prior distribution Pg is clearly important for search-and-score based algorithms. One 
choice is the uniform prior, for d nodes, the probability of choosing a particular DAG G is: Pg(G) = 
where is the well known number of DAGs with d nodes. Kuipers and Moffa 11(2013) show how to 
sample a DAG from a uniform distribution. 

A straightforward way to generate a random DAG, where the distribution is not uniform, is: firstly, 

1 

take a random permutation of the d nodes a, each permutation with probability —. Next, randomly 

d\ 

generate an upper triangular matrix D where elements D^j = 0 for i> j and the other elements of the 
matrix are assigned the value 1 with probability p and 0 with probability 1 - p. The graph then has 
directed edge a{i) a{j) if and only if Dij = 1. 

Conditioned on the permutation a, the probability of obtaining a DAG with exactly k < ^ ^ ^ 

edges is p^{l - Computing the marginal probability of a given graph structure is a 

difficult problem here; the number of different permutations giving rise to the same graph depends on 
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the number of different nodes used, which depends on the number of connected components; 

a! 

G has k edges; n{G) is number of permutations where this graph is possible. 

There are various approaches to the construction of prior distributions Pg corresponding to practical 
prior information about the structure of the DAG. In Q], priors of the form 

Pg(G) oc 

where the functions {fi{G)} are called concordance functions. These functions are constructed from 
prior information about features that the graph is likely to possess; e.g. individual edges, classes of 
vertices, sparsity and degree distributions. 

2 Outline of the Method and Results 

As with Mansinghka et al. , we consider situations where there is a natural, but unknown, partition 
of the variables into classes and the probability distribution over the variables may be expressed as a 
Bayesian Network where the DAG only has the possibility of an edge x ^ y if and only if x belongs to 
a class of strictly lower order than class y. This is an ordered block model. Kemp et al. introduce a 
prior, which we call the Hoppe-Beta Prior, which is a joint distribution over classifications and graphs. 
In our case, we make the further assumption that this partition of the variables corresponds to the 
minimal layering of the DAG of the Bayesian Network. 

The Hoppe-Beta Prior We describe the Hoppe-Beta prior; this is the name we give to the prior 
introduced by Kemp et al. Q] . The description also explains how to sample from the distribution. Each 
step is computationally straightforward and uses natural tools. The distribution has three parameters, 
which may be used to control the sparsity and consistency of the sampled graphs; consistency will be 
defined later. The algorithm can be divided into two main steps: 

Step 1 In the first step the nodes are partitioned into numbered classes using the Hoppe-Ewens urn 
scheme Hoppe B(1984). The prior over classes is therefore constructed first, without reference to the 
prior over DAGs. 

Step 2 The second step is the prior over DAGs conditioned on the classification. Firstly, the edge 
probability from a class o node to a class b node is generated using a suitable Beta distribution, these 
are independent for different pairs of classes and only edges from lower to higher classes are permitted. 
A DAG is then generated using these probabilities; conditioned on the random variables generated by 
the Beta distributions, the indicator variables for edges between pairs are mutually independent. 
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The Minimal Hoppe-Beta Prior We introduce a new prior, over graph structures, which we 
call the Minimal Hoppe-Beta Prior. As with the Hoppe-Beta prior, we first generate a class vector, 
according to the Hoppe-Ewens urn scheme. We then ensure that this structure provides the minimal 
layering for a DAG. That is, a node in class Ci has at least one parent in class Cj_i. We first produce 
a skeleton; for each node in a class Cj, choose one node at random (each with equal probability) 
from class Cj_i and add an arrow. We call these compelled edges. This skeleton graph ensures that 
the classification from the Hoppe-Ewens s cheme provid es a minimal layering. Then we decide on the 
remaining edges using the same scheme as iKemp et al.l with the Hoppe-Beta prior. 

We are able to obtain a convenient closed form expression for our prior. It has the advantage that 
it is a prior only over graphs. The graph structure implies a class vector, which is the minimal class 
vector. If one wishes to infer class structure from a DAG, then the minimal layering represents as 
much of the class structure that can be inferred from data alone. 

There are other possible choices of class structure other than minimal layering. The point is that 
data influences the class structure only through the update on the graph structure. 


The Posterior Distribution The nodes of the graph are random variables (Xi,..., X^). Given an 
nx d data matrix x of n instantiations, the Gooper-Herskovits likelihood is used to give a likelihood 
function for the graph structure given data. Given the data matrix, this likelihood only depends on 
the graph structure; the data influences classification only through the graph structure. This is true 
both for the Hoppe-Beta prior joint distribution over class /graph structure (where the distribution 
over classes given the graph does not change with data) and for the Minimal Hoppe-Beta prior over 
graph structure (where data updates the distribution over graphs, and the minimal layering is chosen 
for the class structure). 


Monte Carlo Methods for the Posterior A Gibbs sampler is considered for the posterior dis¬ 
tribution, but this turns out to be rather slow and wrongly classified nodes have difficulties changing 
class. To find the Maximum Aposteriori Structure (MAPS), a stochastic optimisation algorithm is 
used. The moves are based on the Hoppe-Ewens urn scheme for moving between classes, along with 


some addition / deletion of edges. A proposed move x 


y is accepted with probability min 



where § is the score function. 

We compare inference from the posterior for three prior distributions: the uniform prior over 
graph structures, the Hoppe-Beta distribution where we consider the graph structure and the Minimal 
Hoppe-Beta prior. 


3 Block Structured Priors 

We now describe the Hoppe-Beta distribution over classification and graph structures. 

Let X = {Xi ,..., Xd) denote d nodes of a graph. The indexing set is (1,..., d). Each node belongs 
to a class, where a priori the assignment of nodes to classes is unknown and the number of classes 
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is also unknown. Let z = {zi, Z2, ■ ■ ■, Zd) be the class assignment vector, , where Zi = j means that 
variable i is of class j; the classes are labelled by the positive integers. 

For the ordered block model, a DAG represents direct influences between the nodes. The nodes are 
also classified and all the arrows of the DAG are from nodes of a lower class to nodes of a higher class. 
For the prior distribution, firstly a classification vector is generated via a Hoppe-Ewens urn scheme and 
then a DAG is generated based on the class structure. The classes represent a hierarchical structure. 
Therefore, the classification vectors = (1,2,2) = {{l}i, {2,3)2} and = (2,1,1) = {{2,3}i, {1}2} 
give different classification structures; for the DAG, directed arrows will go from variables of a class 
of lower index to variables of a class of higher index. We shall call the sets in a partition cells, layers, 
classes or colours. We denote a permutation of d elements in the Cauchy 2-line notation. E.g. the 
permutation p defined by p{l) = 2, p{2) = 3, /9(3) = 4, p{‘i) = 1, is denoted by ( 2 i 4 1 )• 


3.1 Prior Distribution over Node Classification 


Let z = {zi,... ,Zd) denote a class assignment vector, generated by a Hoppe urn model. The nodes 
{!,... ,d} are introduced one by one, in order lowest to highest. An urn initially has an orange ball 
of weight a > 0. 

The size of a will influence the number of classes; the smaller a the fewer classes. If a = 0, then 
all the variables will be in a single class and hence the resulting DAG will be the empty DAG. 

Each ball added to the urn has unit weight. At the nth selection, we draw a ball at random, in 
proportion to its weight, from the urn. If we draw the orange ball, then we put it back, together with 
an additional ball of a colour that has not yet present in the urn. This new colour is the ‘value’ of Zn- 
The colours are numbered according to the order in which they were introduced to the urn. If we do 
not pick an orange ball, we put the ball back, together with another ball of the same colour and, in 
this case, this is the value of Zn- 

The orange ball takes the label 0; note that none of the items introduced are placed in class 0. 

Firstly, node 1 is assigned to class 1 (a colour different from orange). 

Assume that nodes l,...,j have been assigned to classes and that there are now a total of kj 
colours different from orange. For {zi,... ,Zd) e ,d}'^ we set 


where 


Set 


ml^\zi,..., Zd) = '^li{zi), i = l,...,d, 
1=1 


li(a:) = 


1, X = i 
0, X i. 


kj 

Thus counts the number of nodes from {!,... ,j} in cell i, so that = j. 
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(2) 



Generation The classification vector (zi,... ,Zd) is considered as the outcome of a random vector 
(Zi,..., Zd)- The algorithm begins with zi = l and proceeds as follows: For j e {1 ,..., d - 1}, 


+ Iki,--- ,Zj) 


a 


a + j 


Z'.! 

m- 


ViZu-Z, • • • > = -j- i = l,...,kj. 


a + j 


Let 


K := max 2 : 


'J- 


SO that K is the total number of classes. 


( 3 ) 


Expected number of cells 


Let Kd be the number of cells generated by Hoppe’s urn scheme with d balls. Let to be the indicator 
function of the event that a new class is created in round i, i = 1,... ,d. Then P(/j = 1) = IE[/j] = 

Oi 

-for i = 1,..., N. The expected number of classes is: 

a + i - 1 


Note that 


E[Kd] = E[Y^I,] 


2=1 


a a a 

^ + • • • + . 

a a + 1 a + d-1 


( 4 ) 


a 


a 

— +- 

a a + 1 


a 


H— + 


= aT-Ld{a) 


( 5 ) 


a + d - 1 

where Tid is the generalised harmonic number. It is straightforward to compute that for any fixed a, 
T~Ld{o:) ~ ln(d) as d ^ 00 . Thus 

nKd] 

lim ——— = a. 

d ^+00 ln(a) 


3.2 Prior over Graph Structure, Given Class Structure 

For unlabelled classes, the distribution of Z is exchangeable. The same effect is achieved by randomising 
the order of the classes. Let R be the space of possible class permutations; if there are K classes, then 
/9 is a permutation of {1,..., iL}. The conditional distribution of p, conditioned on Z is: 

^r\z{p\z) = yv 

When constructing the DAG, we only permit edges from nodes in a class a to nodes in a class b if 
p{a) < p{b). Conditioned on the classification vector 2 : and the class ordering p, edges are mutually 
independent of each other. Firstly, we randomly generate the edge probability and then, for each pair, 
the existence of an edge is the outcome of a Bernoulli random variable. 

For (a, 6) e {1,..., iL}^, define the density fa^ as follows: 
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fa,b{x) 

6o{x) 


1 


(/3a,b;l) Pa,b-,2 ) 

0 , 


^0a,b'A ^(1 _ ^ 


0 < X < 1 

otherwise. 


a < b 
a> b 


( 7 ) 


where for each a<b, Pa,b-,i > 0 and Pa,b -,2 > 0. If /3a,b-,i = 0 then fa^b = <^0 and if Pa,b-,2 = 0 then fa^ = 5i. 
Here is the beta function defined by: 


B{x,y) 


r(x)r(y) 
r(x + y) ’ 


where r(-) is the Euler gamma function. 

Let rj = (tjafi ■ a,b € {1,..., K}) be a random matrix of edges probabilities where Tja^ are indepen¬ 
dent random variables and, for each (a, 6) e {1 ,..., K}^, rja^b ~ fa,b- Let S denote the dxd matrix with 
entries Ci,j = dpizOA^B' 

Let G denote the edge set of the graph. This is the dxd matrix with entries 


^ I 1 edge i ^ j present 

1 0 edge i ^ j not present 

Conditioned on the matrix S, G is the outcome of a random matrix Q where the entries of Q are 
independent and Gij ~ Be(^jj) (i.e. a Bernoulli trial with success probability (,ij). 

Then the (prior) probability of the DAG G given a partition z, class permutation p and the edge 
probabilities ^ is found by: 




d d 

nne 

x=l y=l 


Gx,y 


(1 ^x,y) 


1-Gx 


( 8 ) 


with the convention 0^ = 1. It is clear from the construction that d is the edge set of a DAG. Note 
that we take 0^ = 1. 

Mansinghka et ah and Kemp et ah 0] restrict attention to the situation where Pa,b-,i = Pi and 
Pa,b-,2 = p2 for all 1 < o < 6. The nature of the graph prior is therefore controlled by three parameters, 
a. Pi and P 2 - 


While leading to computational convenience, removing the dependence of Pi and P 2 on a and b 
means that edges are equally likely between a node and any other of a higher order. The expected 


value is 


Pi 


Pi + P 2 


and the variance 


P1P2 


{Pi + / 32)(1 + Pi+ P2) 


A lower 


Pi 


Pi + P2 


leads to a sparser graph. 


Another situation of interest, which is computationally convenient, but which is only valid when it is 
known a priori that class i only influences higher order classes via class i + 1, is the model where 


fa,b{x) = So{x) b*a + l (9) 

and Pa,a+i;i = Pi, Pa,a+i;2 = p2- In this situation, each class has at most one adjacent class; for variables 
in a given class edges from these variables are only possible to variables in the adjacent class. 
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A computationally convenient setting, which gives flexibility, is to consider two setting: = 

for i = 1,2 and I3i-jj+k = (3i,2 loi k > j + 2, i = 1,2. The parameters are chosen so that the edge 
probability from j to j + 1 is higher than j to j + k where k> 2. 


Example 3.1. 

In Figure 2a||Figure 2d a possible outcome of the algorithm is shown in 4 steps. Here d = 8, 

Z=(ll 2 3 3 3 2 l), p=({ii). 


Figure 2a shows the nodes. Figure 2b shows the partition of the nodes generated by Hoppe’s urn 


scheme, where the colours indicate the different cells in the partition. Cell 1 is coloured in red, cell 


2 is coloured in black and cell 3 is coloured in blue. Figure 2c, shows the new order of the cells, p. 


Figure 2d| shows a possible graph under these conditions. 

@@@@©©@@ 


© @ © 
© @ 

@ @ © 


(a) 


(b) 



3.3 Joint distribution of the DAG and the partition 

The method for generating DAGs described above enables us to derive the joint distribution of the 
classification / DAG {Z,Q), where Z is the random classification vector and Q the random DAG. 


3.3.1 Probability of the Partition 

The probability function of a partition Z is given by the following well known theorem found in 
We need to define the space of possible z vectors: 


Sd = {z ^ {1,... = 1,1 < Zj < max Zk + l'-2< j <d}. 

l<k<j-l 


( 10 ) 


We set 


def z-.d 
TUk = rUf, , 


( 11 ) 


which is the occupation number for cell k after all d nodes have been assigned to cells. Then we have 
the following result. 
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Theorem 3.2. Let K be the number of non-empty cells and 


= ] z e {1,..., = 1,1 < Zo < max Zk + l-2<j<d\. (12) 

( J 

that is Sd is the set of possible values of z. The distribution of the random vector Z for a given 
parameter value a e M+ may be computed explicitly and is given by 


Fz(z\a) 


.K 


a 


r(a) 


K 


T{d + a) 


K = 1,... ,d,z € Sd 


(13) 


0 , otherwise. 

Proof. Consider the probability of a sequence z € Sd (so that zi = 1) with occupation numbers 
mi,... ,mx; the number of classes is K. The probability factorises as: 


d-l 


i=i 


From Equation ([3]), it follows that 


,, a^nf=inzr'^ 

^;z(z) = .. -=- 


nU(a + j) 


nU(a + j) 


mi-1 d-l 

where / = 1 if m* = 1. Finally using F(/3 + 1) = /3F(/3), it follows that F(q: + d) = F(a) ]~[(a: + j) 
1=1 j=o 

from which the result follows. □ 


3.3.2 The Marginal Prior Probability of the DAG given the Partition 

Marginalising over H in = '^■E\z,BFg\E,z gives the following: 

Theorem 3.3. 


Fg\R^z{G\p,z) 


n 

l<p(a)<p{h)<K 


(/^l;p(a),p(6) ^a^bi h-.,p{a)^p{h) 

B{Pl-,p(a),p(b) 1 h-,p{a),p{b) ) 


(14) 


where Nafi is the number of edges between cell a and cell b and Mafi is the corresponding number of 
missing edges. 


Proof. The details of the computation are given below: 
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^g\R,z{G\p,z) 

= £lF’6;|s,z(G|C,2)7rH|ij,z(C|p,5;)rf? 


fUU d7i^-^x,y)^~''^’^7r^lR,z{^\p,z)dC 

x=l y=l 

f' 1 

'dp(a),p(b)^^ ~ dp(a),p(b)) fp(a),p(b){'>lp(a),p(b))dpp(a),p(b) 

/3l:p(a),p(6)-l/i )P2:p(a),p(b)-i 

‘p{a),p{b) ' •pi^)ipi^) ^ 


l<p{a)<p(b)<K ' 




l<p(a)<p{b)<K 


^{Pl:p(a),p{b) ? P2:p{a),p(b) ) 


dPp{a),p{b) 


,n J 


/3l:p(a),p(6) 1 + \P2-.p(a) ,p{by 

'p(a),p(6) ' 'p(^))p(^)'' 


1 + Afn. 


l<p(ci)<p(6)<i<' 

n 

l<p(a)<p(b)<K 


^iPl-,p(a),p(b) ) /^2;p(a),p(fe) ) 
(/5l;p(a),p(6) -^a,6)/^2;p(a),p(fe) ^a,b') 

(/5l;p(a),p(fe) 5 /?2;p(a),p(6) ) 


-dpp(a),p{b) 


where, is the number of existing edges between class o and class b (provided p(a) < p{b)) and 
Ma^b = RRaRT'b-Na,b IS the number of missing edges, rua and mb denote the numbers of nodes in classes 
a and b respectively. Recall that ^ij = Pp(zi),p(zj)- D 


At this point, the Minimal Hoppe-Beta Prior (which we derive later) has a distinct advantage over 
the Hoppe-Beta prior of Mansingkha. To marginalise over p, some additional assumptions are needed. 
If we set Pi = ^i-a,b for all a < 6, i = 1,2, the right hand side of Equation (fTTl) is the same for any valid 
partition of the cells for which the DAG is possible. The conditional marginal probability of a DAG 
G given a partition z is therefore given by 


^g\z{G\z) - ^ Fg\ji z{G\pi,z), (15) 

where g{G, z) is the number of permutations of the cells that are compatible with the DAG G and pi 
is one such permutation. This is self evident: 


^g\z(G\z) - J^^g\R,z{G\p,z)Fji\z{p\z) 

p^R 

= 'li^g\R,z{G\pi,z)— = ^Fg\piz{G\pi,z). 

i=l ^ ^ • 

The quantity g{G, z) is the number of topological orderings of the nodes in G and is discussed, along 
with an algorithm for computing it, in Li et al. 0]. 

Now, with slight abuse of notation, let Nafi = ^p(a),p{b) Ma^b = l^p{a),p(b)] i-®- ^ and M denote 
the numbers of included and missing edges between the classes after they have been listed in order 
/9(1 ),... ,p{K). Putting these marginalizations together, it follows that for Pi-.afi = /3i and j32-.a,b = 1^2, 
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(16) 


m 9{G,z) T-r B{/3i+Nafi,(32 + Mafi) ^ r(a) 

^g,z{G,z)= 11 - -a p/X li(mfc-l)! 

l<a<b<K B[^Pi,(j2) l \d + a) 

where g{G, z) is the number of orderings p of the classes 1 ,..., X that are compatible with the DAG 
G. 


Note Consider the DAG in Figure [3l Suppose the partition is Gi = {1}, C 2 = {2}, G^ = {3}. If 
A;a,f) = A: i = 1,2 for all a < 6, then the edge probability 1 1 -^ 3 is the same whether the classes appear 
in the order Ci,G 2 -,Cz or C' 2 ,Ci,C' 3 . If /3j:i,3 * /3i:2,3 then the order makes a difference. 


4 The Minimal Hoppe-Beta Prior over Graph Structures 

We now present a new prior distribution over grap h structures, whic h we call the Minim al Hoppe-Beta 


Prior This is based on the Hoppe-Beta prior of iKemp et al 
features that are more convenient. 


and 


Mansinghka et al 


but has some 


4.1 Hierarchical Graph Drawings 

A hierarchical graph drawing or layering of a DAG Q = {V,D) is a partition of the nodes in numbered 
layers such that all edges are directed from nodes in layers of lower rank to nodes in layers of higher 


rank, c.f. Healy and Nikolov [^. This is a style of graph drawing for visual understan ding of hierarchical 
relations. There are several algorithms for achieving this, as surveyed in Tamassia [lit]. 

The layering of a DAG is not necessarily unique. Consider the DAG in Figure [3l There are 3 
possibilities for layering this DAG: 



• Layer 1: nodes 1 and 2, Layer 2: node 3. 

• Layer 1: node 1, Layer 2: node 2, Layer 3: node 3. 

• Layer 1: node 2, Layer 2: node 1, Layer 3: node 3. 

While node 3 is always in the highest layer, there is some ambiguity with nodes 1 and 2. They can 
be either in separate layers or in the same layer. 
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When inferring classification, the DAG structure alone cannot distinguish between these possibil¬ 
ities. For our prior over graph structure, we assume that the layering is minimal. For the 3-variable 
DAG of Figure [3l the minimal layering is the first of the possibilities listed. 

We call the layering minimal if 

• it has the smallest possible number of layers for the DAG under consideration and 

• among layerings satisfying this criterion, the nodes are in layers with as low a rank as possible. 
The minimal layering represents, in some sense, the class structure that can be inferred from the DAG. 


4.2 Outline of the Minimal Hoppe-Beta Prior 


For the Minimal Hoppe-Beta Prior, we use the Hoppe-Ewens urn scheme to generate the classification, 
just as Kemp et al. |^. The difference is that this classification corresponds to the minimal layering of 
a DAG and we choose a minimal number of edges at random to form a skeleton] namely a DAG with 
a minimal number of edges to ensure that the classification is a minimal classification for the DAG. 
We then add in additional edges randomly according to the approach of Kemp et. al. 


Step 1 For nodes l,...,d, generate a class assignment vector according to the Hoppe-Ewens 


urn scheme. Using ^ \i{zk) and let Kj = max{f: ^ 0} 

k=l 


Ti) 


¥{Zj = i\Zi,.. .,Zj.i) = 


m 


0 - 1 ) 


j -1 + a 
a 

j -I + a 


i — 1,..., Kj—\ 
i = Kj-i + 1. 


Step 2 Let K = K^. This is the total number of classes. Let p be a randomly chosen permutation 
of (1, ..., A'); conditioned on K classes, each p is chosen with probability —. The permutation 
represents the ordering of the layers, from lowest to highest. 


• Step 3 Let K be the number of classes. For each j = 2,... ,K, for each node v e (where 

Ci denotes class i) choose a node w randomly from those in class (each with equal 

probability). Add in the directed edge w ^ v. 

After stage 3, a skeleton graph has been produced. This is a graph whose minimal layering 
corresponds to the classification generated by 2 and p and a graph with the minimal number of 
edges necessary to have this property. 


• Step 4 The remaining edges which are not in the skeleton are added according to the scheme 
outlined by Kemp; for p{i) < p{j), an edge probability ^ij is generated according to a Beta 
hp{i),p{j)) distribution. This is the edge probability between class i nodes and class 
j nodes when p is the class permutation. The random variables and ^* 2,72 independent 
for (n,ji) * (f 2 ,j 2 ), p{ii) < p{ji) and pin) < p(j 2 )- 
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Example 4.1. 


In Figure 4'^Figure 4d| a possible outcome of the algorithm above is shown. The setup is similar to 
that in Example 13. II with 4=8, 


z = (l 1 2 3 3 3 2 1 ), p={\ll). 

The additional step in this example is step [Figure 4c| where the skeleton is created. 


© @ @ 

@ @ 

@ © ® 

(a) 



Figure 4- Figures a)-d) show the steps in the minimal Hoppe-Beta prior 


4.3 A Formula for the Minimal Hoppe-Beta Prior 

Let G be a DAG on d nodes whose minimal layering has K layers, with mi ,...,rriK in each, the layers 
ordered from lowest to highest. 

In our scheme, a DAG implies a single class structure (allocation of objects to classes and the order 
of the classes). The above generation scheme gives the probability the DAG as: 


PeCG) = 


1 - 1)! + Ajj+i rrij, l32-jj+i + Mj-j+i) 


n 


K'- nti(a + j-l) 1=1 
^ \\ fi /^2;i,i + ^j,i) 


^(/?l;i,i+l,/32;j,i+l) 


1=1 i=j+2 


-B(^l;j,j,/32;j,i) 


(17) 


1 


Here is the probability of permutation p of the classes 1,., 


T.- (mi - 1)! . , 1 ,-I- 

, K, — . ^ -IS the probability 

ni=i(a + i - 1) 

of the class assignment vector according to the Hoppe-Ewens urn scheme, Na^b denotes the number 
of edges between class a and class b nodes, while Ma^ = ruamb - Nab denotes the number of missing 
edges. The first term follows from forcing each node of class j to have at least one parent in class 
j - 1; the last term follows because there is no such forcing between different pairs of classes. 

The formula follows from using 


P(DAG) = P(DAG|skeleton)P(skeleton). 

skeleton 

For a given layering (choice of classification vector z and p - ordering of the classes; rrij denotes number 

K / I y 

in class p{j)), all skeletons have the same probability FT -| 

i=2\”^i-i/ 

layering have the same probability. 


and all DAGs, given skeleton and 
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4.4 Properties of the Distribution 

Having declared that we are restricting ourselves to the minimal layering, so that our prior is purely 
a prior over graph structures (and the graph structure implies the minimal layering), we can now 
proceed to present some straightforward properties of the distribution. 

Firstly, the probability of the empty graph is clearly equal to the probability that there is exactly 
one class. This is: 


P„( empty graph) = — --. 

n|=2(j - 1 + a) 

Clearly, as a ^ 0, Pq( empty graph) ^ 1. 

The case of I5i-^a,b = A for all a < 6, i = 1,2 When A;a,f) = AA = 1)2 for all a < b, it is possible to 
compute some reasonably straightforward properties of the prior. In this case, the prior is a function 
of three parameters; a,/3i and /32- One convenient measure of sparsity is to consider the expected 
number of edges and compare it to the total number of possible edges for a DAG on d nodes, which 
is -d{d - 1). 


E [edges] 



)(d-EK,.,]) + (^) 


/3i + A' 
/?! 


E 


K-l K 

E E 

2=1 j=i+l 


Pi + A 


)(rf_EK,.,])A(^)(/-E[|™?]) 


This comes from the following: firstly, each of the d nodes has at least one parent (a compelled 
edge) for the skeleton, except for those in the lowest layer. Then there are the remaining edges. 
For p(j) = p{i) + 1, consider the non-compelled edges which are added in, each with probability 
———. For the second line, only nodes within the same block cannot have an edge. Recall that 

A + A 

E[A] a\\i{d). Now suppose that a{d) varies with d and that a' ■= a{d)ln(d) is kept constant. 


1 


K 

Ew 

Li=l 

which satisfies /(O) = 1 and /(+oo) = 0. It follows that 


For fixed a' ■= a{d) ln((i) > 0, can be shown that lim —E 

d^+oo 

function defined on 


= f{a') where / is a decreasing 


E [edges] 




lim , . 

d^+oo^d{d-l) VA+A 

This may be considered as a sparsity index, since —d(d-1) is the maximum number of possible edges. 
From this expression, it is clear that there are two parameters for controlling the sparsity. Firstly, low 
values of ^ j lead to a sparse graph. Secondly, low values of aln(d) lead to a sparse graph. 

As discussed in lMansinghka et al.l . the role of /3i + A is also of interest. While the expected number 
of edges only depends on A and A only through ^ ^ , the sum of A + A provides a consistency 


parameter. If /3i + A is small (with 


A 


Pi + A 


/?! + A 

= 0.5), the outcome of the random variable ^ij, which 
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has Beta(/3i,/ 32 ) distribution will typically take values either close to 0 or close to 1 . This prior will 
therefore generate graphs which either have many edges between a chosen pair of classes or few edges 
between a chosen pair of classes (while the average edge probability is 0.5). If /3i + P 2 is large, the edge 
probabilities between chosen pairs of classes will be similar. 


5 The Posterior Distribution and Monte Carlo Methods 

Now suppose that the d nodes of the graph are random variables {Xi,... ,X(i) and an n x d data 
matrix x containing n independent instantiations of the d variables under study (identified as nodes) 
is given. 

Let X denote the random matrix from which x is an observation. The aim is to infer a DAG along 
which the probability distribution factorises and the class structure, which is the minimal layering of 
the DAG. 

Cooper-Herskovits Likelihood and Posterior Our assumption is that once the graph structure 
is known, the class structure (or layering) gives no further information. That is: 


This may be evaluated explicitly and is well known as the Cooper-Herskovits likelihood, derived in 
[it, given by: 


wxiG') = nfi 




j=ii=i T{n{Tr) ’) + Ziiiljii) *=i 




Here (x 


( 1 ) 


, x^J’^ ^) is the state space for variable Xj , while (vrj^'’,..., '’) is a listing of the possible 


( 1 ) 




(18) 


parent configurations for variable j in the Bayesian network. The parameters {'Jju ■ j = 1,... ,d;i = 
1,... ,Pj;l = 1,... ,qj) are hyperparameters that can be chosen depending on prior information. We 
take ')jii = 7 > 0 all equal, so that there is one free parameter 7 for the Cooper-Herskovits likelihood. 

In the algorithms described below, computational savings are made if only a small part of the 
graph needs to be considered. Suppose only one edge at a time is changed. Let Gij = 1 if there is 
an edge i ^ j and 0 if there is no edge i ^ j. Let (xj,... ,x^*) denote the state space of variable Xj. 
Let G~ denote a DAG where Gij = 0 and let G^ denote the graph G~ with Gij replaced by Gij = 1. 

Assume that G^ is a DAG. Let qj+ denote the number of parent configurations for variable j in G^ 

P;^|g(x|G+) ^ 

and qj- the number in G . Note that qj+ = Piqj-- The ratio - - —[ 777 - may be computed, from (|18]1 . 


as: 


IPa'|£;(x|G' ) 


Pv|s(x|G") 

P;r|e(x|G-) 


/% r(pj7 + n(7r^.^'^)) K K + 

\r(7)^^ / ;=i nfclir(pj7 + n(7r:^'\a;p^)) k=ia=i Hfcli r(p^-7 + n(7r:^'\ xj“^)) 


(19) 
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where '^) is an enumeration of the parent configurations of variable j in graph G~. 

This formula only depends on the variable Xj, the parent set PaJ of Xj in graph G~ and the additional 
variable Xi. 

We tried two algorithms; a Gibbs sampler and a stochastic optimisation algorithm. These two algo¬ 
rithms have different objectives. The aim of a Gibbs sampler is to generate an empirical distribution 
which approximates the posterior distribution. This is useful for exploring properties of the posterior. 
The stochastic optimisation algorithm simply looks for the maximum aposteriori structure. 

While the Gibbs sampler is theoretically ergodic, convergence was very slow. The main difficulty 
was that nodes in the wrong layer had difficulty bubbling up to their correct layer. 

With this in mind, the stochastic optimisation algorithm was constructed to ensure mobility be¬ 
tween layering. The moves were constructed by choosing a node at random and re-assigning it ac¬ 
cording to to a Hoppe-Ewens urn scheme. 


5.1 Gibbs Sampler 

We now describe the Gibbs sampling scheme. For the Minimal Hoppe-Beta posterior, there are d{d-l) 
variables; Variable Gij is a binary variable taking value 1 if and only if the graph 

G has a directed edge i <->■ j. 

We consider a Gibbs sampler, working through these variables one by one, conditioning on all the 
other variables. Let V = (Vi,..., V^(d-i)) represent an enumeration of the binary variables {Gij)ii,j. 
Denote the ith sample by The algorithm proceeds as follows: 

Initialisation Let {G^^^) be the initial condition, where G^^^ is the empty graph. 


Sampling To generate a random sample of size fc, for each sample i e {1,... fc}, do the following: 
• For j = 1,..., d{d - 1), sample from the conditional distribution 




(0 Xi-l) 


,x 


(i-1) 

d(d-l) 


) =: 


( 20 ) 


where X-j = (Vi,..., Xj.i , Xj+iXd(d-i )) • 

The probability Pg from (I17p together with the Cooper-Herskovits likelihood gives the posterior: 


Pg|x(G|x) oc Pg(G)Px|g(x|G) =: S(G|x) (21) 

If Xj to be sampled is a graph edge variable Gxy then it takes value 0 with probability 1 if the graph 
with an edge x i->- y is illegal (i.e. leads to a directed loop). Otherwise, aj{Gxy = 1) is computed from 
Equations (fT7|l and (fTOjl . 
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5.2 Stochastic search algorithm 


In this section we present the stochastic search algorithm used for maximising, at least approximately, 
the posterior score function defined in Equation (I21jl . The algorithm generates a non-reversible Markov 
chain of DAGs started from an initial DAG A move from site G^^^ to site is 

then made by proposing a move to G' according to the distribution where the transition 

kernel Q is described below and then accepting the move with probability 


cxQ(t) Qt — min 


’ S(G(*)|x)J 


( 22 ) 


The Proposal Kernel Q For each DAG in the chain we let denote the corresponding 

layering. The algorithm is initiated with the empty graph G^^\ thus the minimal layering of 
G(o) is a vector of zeros. We proceed as follows: 

1. Let J the minimal layering of G^*^ and call = T. Choose a node x at random and 

remove it (each node with equal probability). If the chosen node was in a layer of its own, 
decrease the order of each layer with order > j, that is set 2 new = 2 ^^ - 1 for all 'zk such that 
Zfc > j and then set S' = Tnew so that there will be no empty layers. Set T:= S'new- 


2. Let K be the number of layers left. Now re-assign the node x in the following way: put it in 

s 

a new layer, {K + 1) with probability ——-- and in layer j for 1 < j < AT with probability 

a + (a - 1) 
rrij 

——-- where m,- is the current occupancy number in layer j. In other words, the node is 

a + (d - 1) 

reassigned according to a Hoppe-Ewens scheme with parameter a. 

3. If X is now in layer AT + 1, let m be chosen randomly according to the uniform distribution over 
{1 ,..., K + 1 } and set: 


Zj = Zj, j ■ 1 < Zj < m - 1 

' Zi = m, j - j = i (23) 

Zj = z'j + 1, j -■ m + 1 <'zj < K 

That is, the new layer is placed in position m and the indices from m to K are pushed one place 
to the right to compensate. This process has generated a new partition vector z. 

4. To construct a proposal G', let G = G^^K The minimal layering, z' for G' will be derived from z 
constructed above. 


(a) Remove those edges contradicting the partition z (i.e from class a to class b for b < a). 

(b) If Zx > 2, if X does not have any parent in layer Zx - 1, add a single compelled edge (p,x) 

where p is randomly chosen from layer Zx -1, each with probability-. 

Zx - 1 
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(c) If Zx = 1, add X as a parent to each node in + 1. 

(d) Steps (a), (b), (c) have generated a DAG G. Let Znew denote the layering of DAG G. This 
differs from z only if there are children of x in graph G for which x was the only parent in 
G. For such nodes, Znewc = max Zp + 1. Here Pa^(c)\{x} denotes the parent set of 

^ ’ pePag(c)\{a;} 

c in graph G without x. This process continues recursively for the children of x until every 
node in the graph is in it minimal layer. 

5. Let z denote the current partition vector (z from step 3 modified by step 4 (d)). Now choose 
two nodes y and w at random. If Zy> Zw, do nothing. Otherwise, toggle the edge between y and 
w (i.e remove it if exists and add it if it does not exists). Let G' denote the resulting graph. 

6 . Let z' be the minimal layering of graph G'. If {y,w) was removed, then place w in its minimal 
layer as in step 4 (d). 

This process generates a proposal DAG G' with minimal layering z^ 

Alternatively, the layering z from step 3 could have been taken as the minimal layering of the graph, 
with step 4(d) replaced by adding in a minimal number of edges to ensure that children c of node x 
were in the appropriate class. The rejection rate was higher with this approach. 

6 Simulations 

6.1 Data generation 

The simulation studies were made on random samples from the HEPAR II network shown in [Figure Jj 
All the variables are binary and the parameter in the conditional probability tables where indepen¬ 
dently sampled from a Beta{0.5, 0.5)- distribution and were then adjusted so to ensure that they lie 
in the range (0.1,0.9). 

6.2 Simnlation resnlts 

The stochastic search algorithm with the three types of priors (uniform, Hoppe-Beta and minimal 
Hoppe-Beta) was tested on 500 samples from the HEPAR H network shown as an adjacency matrix in 
Figure 5] E|. For the minimal Hoppe-Beta and the Hoppe-Beta prior, the hyper parameters were set to 
= 2, ; 02 ;i,i+i = 1 for i = 1,..., a: - 1 and = 1, l32-i,j = 2, for i = 1,..., A - 2 and K>j>i + 1. 
This reflects the prior knowledge that between layers i and i + 1, the graph is denser, while between 
layer i and j where j > i + 1 the graph is sparser. For the sparsity parameter we used a = 1. For all 
the three priors we used 7 = 1 and a = 1 . 

'^The same study were performed on 10 different datasets showing similar results, for that reason, results from only 
one representative dataset are considered here. 
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As a measure of goodness of fit we use the sensitivity (TPR) and specificity (SPC) defined as 
follows 


TPR 


def 


Number of edges correctly identified 


and 


SPC 


def 


Number of edges correctly identified + Number of edges falsely rejected 


Total number of edges in the skeleton 


Total number of edges in the skeleton + Number of edges wrongly included 

Here, the skeleton of a directed network means its undirected version. 

We ran 10 trajectories for 5000 iterations for each prior and the results are summarized in IXableTI 
and in Figure 6 Figure 8 Since the Hoppe-Beta prior is joint prior over partitions and DAGs, we used 


this joint score when evaluation this method. The mean and standard error for the TPR and SPC 
taken over the 100 best scoring graphs among the 10 chains are found in ITableTl As seen the minimal 
Hoppe-Beta prior shows the best results in terms of SPC and TPR among the methods. 


Read from the top, Figure 6 - Figure 8 show in the left columns, the trajectories for the prior, the 
Cooper-Herskovits likelihood and the scoring function. The black line in each plot is the corresponding 
function evaluated at the true HEPAR H network. The right column shows heat maps over: the top 
100 scoring DAGs among all 10 trajectories, the single top scoring DAG among the 10 trajectories 
and the top scoring graph in each trajectory. Looking at the heat maps for the top 100 scoring graphs 
from the 10 chains, we see that we get the clearest results with the minimal Hoppe-Beta prior. 


Prior 

TPR 

SPC 

Uniform 

Minimal Hoppe-Beta 
Hoppe-Beta 

0.77 / 0.09 
0.85 / 0.10 
0.80 / 0.19 

0.79 / 0.07 
0.86 / 0.06 
0.84 / 0.12 


Table 1: The results reported are mean values and standard error (mean/std. error) taken over the 100 best 
scoring DA Gs among the 10 trajectories of the stochastic search algorithm ran for 5000 iterations each. 


HEPAR II 



Figure 5: The HEPAR II network shown as an adjacency matrix. 
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logP(x|G<‘>)P(G<'>) logP(x|G(‘)) 


Prior: min-hoppe-beta, data points: 500, a =1, =2, =1, ^i\i,j>i+i =1, l32\i,j>i+i =2, n =1 
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Figure 6: Results from 10 stochastic search trajectories with the minimal Beta-Hoppe prior ran for 5000 itera¬ 
tions each. 
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Prior: unif, data points: 500, a =1 
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Figure 1: Results from 10 stochastic search trajectories with the uniform prior ran for 5000 iterations each. 
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Prior: hoppe-beta, data points: 500, a =1, =2, =1, =1> p 2 ’,i,j>i+i =2, a =1 







Figure 8: Results from 10 stochastic search trajectories with the Hoppe-Beta prior ran for 5000 iterations each. 


7 Summary and Conclusions 


This article considered the Ordered Block Model introduced by Kernp et al. which was applied 
to Bayesian Networks for multivariate data by Mansinghka et al. [7|. We introduce a new prior 
distribution over graph structures, which is a modification of the Hoppe-Beta prior introduced by 
Kemp et al.l . The probability distribution over graphs has an explicit closed form. The parameters can 
adjusted to determine sparsity and consistency] consistency refers to the similarity of edge probabilities 
between nodes of different pairs of classes. We call t his prior the minimal Hoppe-Beta. It builds on, 
and represents an advantage over the Hoppe-Beta of iKemp et al.l . which is a joint prior over graphs 
and classes. With the minimal Hoppe-Beta, the class structure is implied by the graph. 

The prior is then tested experimentally; a posterior is obtained via the Cooper-Herskovits likeli¬ 
hood. We run a stochastic opti misation sche me and compare the output with three different priors; 
the uniform, the Hoppe-Beta of iKemp et al.l and the new Minimal Hoppe-Beta. 

The Minimal Hoppe-Beta compares favourably. 
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